NSF PAR Search | NSF Public Access Repository

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Chen, Changan; Peng, Puyuan; Baid, Ami; Xue, Zihui; Hsu, Wei-Ning; Harwath, David; Grauman, Kristen (July 2024, https://doi.org/10.48550/arXiv.2406.09272)

Generating realistic audio for human actions is critical for applications such as film sound effects and virtual reality games. Existing methods assume complete correspondence between video and audio during training, but in real-world settings, many sounds occur off-screen or weakly correspond to visuals, leading to uncontrolled ambient sounds or hallucinations at test time. This paper introduces AV-LDM, a novel ambient-aware audio generation model that disentangles foreground action sounds from ambient background noise in in-the-wild training videos. The approach leverages a retrieval-augmented generation framework to synthesize audio that aligns both semantically and temporally with the visual input. Trained and evaluated on Ego4D and EPIC-KITCHENS datasets, along with the newly introduced Ego4D-Sounds dataset (1.2M curated clips with action-audio correspondence), the model outperforms prior methods, enables controllable ambient sound generation, and shows promise for generalization to synthetic video game clips. This work is the first to emphasize faithful video-to-audio generation focused on observed visual content despite noisy, uncurated training data.

Full Text Available

Search for: All records